PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model and compare relative results.

  1. Data Understanding & Cleaning: A. Read ‘vehicle.csv’ and save as DataFrame.

Visualize a Pie-chart and print percentage of values for variable ‘class’

one_hot = pd.get_dummies(df_vehicle['class'])

one_hot = one_hot.add_prefix('class')

df_vehicle = df_vehicle.join(one_hot)

df_vehicle.head()

Check for duplicate rows in the data and impute with correct approach.

2. Data Preparation:

A. Split data into X and Y.(Train and Test optional)

B. Standardize the Data.

3. Model Building:

A. Train a base Classification model using SVM.

B. Print Classification metrics for train data.

Apply PCA on the data with 10 components.?

Before applying PCA clustering it is necessary that we check for outliers and remove them.

D. Visualize Cumulative Variance Explained with Number of Components.

E. Draw a horizontal line on the above plot to highlight the threshold of 90%.

As we see from the above graph, 90% of the variation in data can be explained using 5 features.

Apply PCA on the data. This time Select Minimum Components with 90% or above variance explained.

Train SVM model on components selected from above step.

We get the best paramters as : C = 5, and selecting gamma=0.1 for rbf. Hence let us re-construct the model with these parameters.

Print Classification metrics for train data of above model and share insights.

4. Performance Improvement:

A. Train another SVM on the components out of PCA. Tune the parameters to improve performance.

B. Share best Parameters observed from above step.

C. Print Classification metrics for train data of above model and share relative improvement in performance in all the models along with insights.

The base svm model , svm_model_1 has the accuracy score of 96% accuracy with svm parameters ,C=3, gamma=0.025, kernel='linear'. After that we applied pca and reduced the feature numbers to 5 and built a model with accuracy 97%. Then again we tuned the hyperparameters of Svm classifier and got the best parameters as (C=7, gamma=0.025) and coud maintain the accuracy attained by pca. 97% accuracy is maintained.

Explain pre-requisite/assumptions of PCA.

Principal components analysis (PCA, for short) is a variable-reduction technique that shares many similarities to exploratory factor analysis. Its aim is to reduce a larger set of variables into a smaller set of 'artificial' variables, called 'principal components', which account for most of the variance in the original variables. It is a good practice to scale all datapoints to a single scale before performing any type of clustering. PCA technique is used when: You have multiple variables that should be measured at the continuous level (although ordinal variables are very frequently used). There needs to be a linear relationship between all variables. The reason for this assumption is that a PCA is based on Pearson correlation coefficients, and as such, there needs to be a linear relationship between the variables. Your data should be suitable for data reduction. Effectively, you need to have adequate correlations between the variables in order for variables to be reduced to a smaller number of components. There should be no significant outliers. Outliers are important because these can have a disproportionate influence on your results. SPSS Statistics recommends determining outliers as component scores greater than 3 standard deviations away from the mean.

Explain advantages and limitations of PCA.

Advantages: 1.The pca techniques efficiently removes correlated features even when hundreds features are present.There is no correlation among them after the pca is done on the dataset. 2.The training time of the algorithms reduces significantly with less number of features. So, if the input dimensions are too high, then using PCA to speed up the algorithm is a reasonable choice. 3.Overfitting mainly occurs when there are too many variables in the dataset. So, PCA helps in overcoming the overfitting issue by reducing the number of features. 4.It is very hard to visualize and understand the data in high dimensions.PCA transforms a high dimensional data to low dimensional data so that it can be visualized easily.

Limitations: After implementing PCA on the dataset, the original features will turn into Principal Components. Principal Components are the linear combination of the original features. Principal Components are not as readable and interpretable as original features. You must standardize your data before implementing PCA, otherwise PCA will not be able to find the optimal Principal Components. For instance, if a feature set has data expressed in units of Kilograms, Light years, or Millions, the variance scale is huge in the training set. If PCA is applied on such a feature set, the resultant loadings for features with high variance will also be large. Hence, principal components will be biased towards features with high variance, leading to false results.Also, for standardization, all the categorical features are required to be converted into numerical features before PCA can be applied. Although Principal Components try to cover maximum variance among the features in a dataset, if we don’t select the number of Principal Components with care, it may miss some information as compared to the original list of features.